Choice of Filter Order in Lpc Analysis of Vowels

نویسندگان

  • Gautam Vallabha
  • Betty Tuller
چکیده

In studies of vowel perception and production, the formants of the signal are usually estimated using Linear Predictive Coding (LPC) analysis. The order of the LPC filter is typically estimated by starting with a heuristic value (e.g. the sampling frequency in kHz) and adjusting it in a trialand-error manner. This method is adequate when analyzing a few sounds, but it is cumbersome for studies involving large vowel corpora (and error-prone, since underand over-estimation of the filter order can both lead to inaccurate formant estimates; Vallabha & Tuller (2002), Speech Comm. 38(1-2), 141-160). We present a “reflection coefficient cutoff” (RCC) heuristic that can be used to determine quickly the “best” filter order for either a corpus of vowels or for a single vowel. The heuristic is as follows: (1) Pick a set of representative analysis frames from the vowel corpus and compute their reflection coefficients. (2) Choose the “best” filter order to be the smallest order past which a majority of the reflection coefficients are less than a pre-defined cutoff. The cutoff value was determined empirically, but we show that it is related to general model selection criteria such as the Minimum Description Length. In addition, we show that the heuristic can reliably mark the distinction between different sampling rates and male versus female speakers so that the filter order can be adjusted appropriately. INTRODUCTION Linear Predictive Analysis (LPC) is a robust method for estimating the formant locations of a speech sound (Markel & Gray, 1976). While the algorithm is straightforward in principle, in practice it requires careful adjustment of several parameters, such as the width and shape of the windowing function, the degree of preemphasis, and the order of the filter. The formant estimates can be especially distorted by the quantization of the speech spectrum due to the fundamental frequency, incorrect choice of filter order, and incorrect estimation of the peaks of the LPC spectrum (Vallabha & Tuller, 2002). Of these three factors, the choice of filter order is the most crucial. If the filter order is too low, the formant peaks are smeared or averaged; if it is too high, the estimated formant locations are biased towards the F0 harmonics. In the worst case, an inappropriate filter order can lead to spurious formant peaks or to formants being missed altogether. The estimation bias is particularly an issue in studies of vowel perception and production that involve detailed formant analyses (e.g. Repp & Williams, 1987). In this paper, we give some evidence for interspeaker variation in filter order and present a heuristic to shorten the trial-and-error process of estimating the best filter order for vowel corpora. Evidence for interspeaker differences in optimal filter order The effect of filter order in LPC analysis can be best seen using the PARCOR (partial correlation coefficient) formulation of the autocorrelation method (Rabiner & Schafer, 1978). If s(1), s(2), etc. are the successive samples, then km is the correlation between s(n) and s(n–m) Vallabha et al. Choice of filter order From Sound to Sense: June 11 – June 13, 2004 at MIT C-204 after the influence of the intervening samples is factored out. Being a correlation coefficient, km (also called a reflection coefficient) can vary between –1 and +1 and has two useful properties: (1) the LPC coefficients for a p-order filter can be calculated directly from k1 ... kp and (2) the difference between the power spectra of the p and the p+1 filters is proportional to |kp+1| (Markel & Gray, 1976, Sec. 6.2.5). In order to examine the relation between filter order and formant estimates, we used natural imitations of 56 synthetic vowel-like stimuli in order to elicit a wide variety of vowel qualities. The stimuli were synthesized using the Klatt synthesizer and had three steady-state formants, with F1 ranging from 300 to 750 Hz in steps of ~65 Hz, F2 from 1700 Hz to 2400 Hz in steps of ~115 Hz, and F3 constant at 2500 Hz. For all stimuli, F0 was 120 Hz and the duration was 200 ms. Next, we asked three speakers of American English (2 adult males, AM1 and AM2, and one adult female, AF1) to imitate each of the 56 synthetic stimuli. In addition, each speaker read a list of [hVd] words ([V] ranging over the monophthongal vowels of English). All utterances were digitally recorded at a sampling rate of 10 kHz. For the LPC analysis, each natural utterance was divided into 256-point pitch-asynchronous frames, which were preemphasized at 100% and Hamming-windowed (Markel & Gray, 1976, Sec. 6.5). Next, a frame was chosen from the middle of the utterance and k1 ... k20 were calculated. These were used to derive the LPC filters for different orders. At each filter order, the formant locations were estimated by taking a 512point Discrete Fourier Transform of the LPC coefficients and estimating the locations of its peaks using three-point parabolic interpolation (Markel & Gray, 1976, Sec. 7.2.2). The peak locations from contiguous analysis frames (covering at least 5 pitch pulses) were averaged to get the final formant estimates. Figure 1 shows the average reflection coefficients and corresponding changes in the formant estimates for the 56 utterances. Note that when the filter order is changed from p to p+1, the magnitude of the formant change is proportional to the magnitude of the mean km. Past a certain Figure 1: Relation between reflection coefficients and formant changes for speakers AM1 and AM2. Blue lines show the average reflection coefficient k as a function of the lag m. Red lines show change in formant estimates when going from a (p–1)-order filter to an p-order filter (solid red line: F1, dashed red line: F2). Vertical black dotted lines mark the approximate critical filter order. Vallabha et al. Choice of filter order From Sound to Sense: June 11 – June 13, 2004 at MIT C-205 "critical" filter order, the mean km values (and also the formant changes) are close to 0 with inconsistent sign (the critical filter orders for AM1 and AM2 are around 13 and 11, respectively). This pattern suggests that increases in filter order beyond the critical order have little effect on F1 and F2. Thus, the critical filter order for a corpus of utterances is a good estimate of their optimal filter order (popt). A HEURISTIC TO ESTIMATE THE OPTIMAL FILTER ORDER Based on observations from data such as in Figure 1, Vallabha and Tuller (2002) proposed the following “reflection coefficient cutoff” (RCC) heuristic for estimating the optimal filter order for a single utterance sampled at 10 kHz: (1) Pick a representative 256-point analysis frame and compute k1 ... k20. (2) Let pmin, the minimum filter order, be twice the expected number of formants. Then, pick popt to be the smallest p > pmin such that |kp+1| and |kp+2| are both less than 0.15. The heuristic is based on the observation that the contribution of km to the power spectrum of the LPC filter is proportional to |km|, so it is safe to ignore “insignificantly” small values of km. Effectiveness The above heuristic was applied to each of the 116 utterances of AM1, AM2 and AF1 (56 imitations of the synthesized sounds + 10 productions each of 3 back and 3 front vowels). For about 15% of the utterances, the sharp cutoff (0.15) caused the heuristic to overestimate popt by one or two coefficients. In these cases, the heuristic was overridden and popt was manually identified. Figure 2a shows the histogram of the resulting filter orders. The locations of the histogram peaks for AM1 and AM2 match the critical filter orders seen in Figure 1, suggesting that the heuristic is reasonably effective in identifying the critical filter order and hence popt. The spread of the histogram peaks indicates the variability in the utterances, which can stem from differences in vowel type, pitch, and breathiness. To characterize the effect of the sampling frequency, we recorded reproductions of the 56 synthesized sounds by a single 13-year-old male speaker. All utterances were originally Figure 2: Histograms of optimal filter orders estimated using the RCC heuristic. (a) Interspeaker differences. Histogram for the 56 speakers of speakers AM1, AM2, and AF1. (b) Sampling rate adjustment. Histogram for the 56 utterances of a 13-year-old male (F0 ≈ 250 Hz) sampled at different rates. Vallabha et al. Choice of filter order From Sound to Sense: June 11 – June 13, 2004 at MIT C-206 sampled at 20 kHz and then downsampled to 15 kHz and 10 kHz. At each sampling rate, the optimal filter order for each utterance was estimated using the RCC heuristic (cutoff = 0.15), using an analysis window of 256 points for 10 and 15 kHz and 512 points for 20 kHz. The theoretical MDL analysis (see below) suggests that the cutoff should decrease as the window size increases. In practice, however, the 0.15 cutoff is fairly robust across the sampling frequencies and analysis windows used for speech. Figure 2b shows the resulting distribution of the filter orders, which suggests that the RCC heuristic can compensate fairly well for changes in the sampling frequency. The RCC heuristic is best used with individual utterances. This is cumbersome when dealing with a large corpus of productions by a single speaker, and a pragmatic workaround is to identify a single filter order that is effective for a majority of a speaker’s utterances. To facilitate this process, we have developed a suite of Matlab programs to calculate the mean km curves, determine the filter order, and find the peaks of the LPC spectrum using fine-resolution peakpicking (Vallabha & Tuller, 2002). The suite can be downloaded from the web (Vallabha, 2000). Relation of RCC to general system identification methods General system identification methods attempt to identify the filter order past which the decrease in prediction error ceases to be worth the increase in model complexity. One interpretation of this tradeoff is that a signal cannot be efficiently compressed past the optimal filter order. Since incompressible signals are equivalent to noise, the optimal order marks the noise level of the signal. Two common identification methods are the Akaike information criterion (AIC) and the minimum description length (MDL) (Rissanen, 1978; Kay, 1988). The equation for MDL is: N p r N p MDL ln ln ) ( + = (1) where N is the number of samples in the analysis window, r is the mean squared prediction error, and p is the filter order. For autoregressive models computed using the autocorrelation method,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An adaptive MEL-LPC analysis for speech recognition

This paper describes a new speech analysis method, an adaptive Mel-LPC (AMLPC) analysis method, using human auditory characteristics. The Mel-LPC analysis method that we have proposed is an efficient time domain technique to estimate the warped predictors from input speech directly. However, the frequency resolution of spectrum obtained by Mel-LPC analysis is constant regardless of the characte...

متن کامل

A data reduction method to estimate vowel distributions and its use in comparing two formant estimation methods

Speech features such as formants of vowels uttered by many talkers are considered to form a normal distribution in each phoneme on a feature space. However, those features may apparently show the different dispersions peculiar to the estimation methods. Therefore, if the correct distributions can be found by a credible method, it will make clear the definition of feature estimation errors so th...

متن کامل

Structural (phonetic) evaluation of dissimilarities functions used in speech recognition

We have evaluated 17 variants of 6 dissimilarities : PLOMP, Log Likelihood Ratio , Cepstrum, Mel Frequency Cepstrum Coefficients, Weighted Slope Metric, and Spectral Peaks Adjustment derived from FFT and/or LPC analysis with two types of integration (KLATT and ZWICKER). We used as "references" synthetic and natural vocalic stimuli for which we have a phonetic structural representation. The inte...

متن کامل

Thai monophthong recognition using continuous density hidden Markov model and LPC cepstral coefficients

This paper presents Thai monophthongs recognition. The monophthongs were qualitatively recognized by the 3-state leftto-right continuous density hidden Markov model. The LPC cepstral coefficients were used as feature which represented specch signal. The temporal cepstral derivative was additionally utilized in order to compare efficiency of the additional feature with that of the single LPC cep...

متن کامل

Acoustic Analysis of Persian EFL Learners' Pronunciation of English Vowels

This paper reports the results of an experimental study on non-native production of English vowels. Two groups of Persian EFL learners varying in language proficiency were tested on their ability to produce the nine plain vowels of American English. Vowel production accuracy was assessed by means of acoustic measurements. Ladefoged and Maddison’s (1996) F1 F2 measurements for American English v...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004